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Abstract 


Great ape clades exhibit variation in the relative mutation rates of different three-base-pair genomic motifs, with closely related 
species having more similar mutation spectra than distantly related species. This pattern cannot be explained by classical demo- 
graphic or selective forces, but imply that DNA replication fidelity has been perturbed in different ways on each branch of the great 


ape phylogeny. Here, we use whole-genome variation from 88 great apes to investigate whether these species’ mutation spectra 


are 


broadly differentiated across the entire genome, or whether mutation spectrum differences are driven by DNA compartments that 


have particular functional features or chromatin states. We perform principal component analysis (PCA) and mutational signa 
repeat content, f 
compartments the spectra are ascertained from. At the same time, we find that many compartments have their own character 
mutational signatures that appear stable across the great ape phylogeny. For example, in a mutation spectrum PCA compartm 


ture 
deconvolution on mutation spectra ascertained from compartments defined by features including replication timing and anci 
inding evidence for consistent species-specific mutational signatures that do not depend on which functional 


ent 


istic 
en- 


talized by replication timing, the second principal component explaining 21.2% of variation separates all species’ late-replicating 








regions from their early-replicating regions. Our results suggest that great ape mutation spectrum evolution is not driven by ep 


ge- 


netic changes that modify mutation rates in specific genomic regions, but instead by trans-acting mutational modifiers that affect 


mutagenesis across the whole genome fairly uniformly. 


Key words: mutational signatures, chromatin landscape, mutation spectrum, great ape evolution, replication timing, 
hydroxymethylation. 


Significance 


All heritable variation begins with damage or copying mistakes affecting the DNA of sperm, eggs, or embryos. 
Different DNA motifs can have different mutation rates, and these rates can evolve over time: the spectrum of 
mutability of three-base-pair motifs has evolved rapidly during great ape diversification. Here, we show that even 
as ape mutation spectra diverged from each other, ape genomes preserved a landscape of spatial mutation spectrum 
variation. We can thus deconvolute the mutational process into a mixture of fast-evolving signatures with uniform 
spatial distributions and conserved signatures that target specific regions. Our findings may ultimately help determine 
the factors, either genetic or environmental, that contribute to temporal and spatial variation in germline mutagenesis. 
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Introduction 


The pace of evolution and the healthspan of somatic tissue 
are both ultimately limited by the genomic mutation rate, 
which is a complex function of DNA damage susceptibility, 
polymerase fidelity, proofreading efficacy, and other 
factors (Alexandrov et al. 2013; Sima and Gilbert 2014). 
Some regions of the genome accumulate mutations faster 
than others, such as DNA that replicates late in the cell 
cycle and motifs with certain epigenetic modifications 
(Stamatoyannopoulos et al. 2009; Koren et al. 2012; 
Schuster-Bockler and Lehner 2012; Liu et al. 2013; Sima 
and Gilbert 2014; Polak et al. 2015). Such mutation rate 
differences have the potential to confound efforts to infer 
patterns of purifying selection and background selection 
(Hellmann et al. 2005; Martincorena et al. 2012). As a result, 
understanding how mutation rate varies across the genome 
is an important prerequisite for identifying modes and targets 
of natural selection (Hellmann et al. 2003; Kulathinal 
et al. 2008; Keightley et al. 2011). Understanding the muta- 
tional landscape is similarly essential for predicting rates 
of deleterious de novo mutations (DNMs) in clinically relevant 
disease genes (Michaelson et al. 2012; Veltman and Brunner 
2012). 

Some variation of mutation rate across the genome appears 
to be driven by features such as chromatin state, transcription, 
or, more broadly, genomic function (Ananda et al. 2011; Liand 
Luscombe 2020). For the purpose of this manuscript, we 
broadly consider a genomic compartment’s “function” to be 
a set of shared molecular interactions that might or might not 
be critical to organismal fitness. However, even with this broad 
definition, a large component of the variance of the mutational 
landscape still escapes association with any known motifs or 
functions and has thus been classified as “cryptic” 
(Hodgkinson and Eyre-Walker 2011; Johnson and Hellmann 
2011; Terekhanova et al. 2017). This cryptic variation is con- 
served over relatively long timescales, being highly similar be- 
tween human and macaque, which diverged ~25 million years 
ago (mya) (Tyekucheva et al. 2008). This conservation suggests 
that many regions of the genome have intrinsic or epigenetic 
features that affect their mutability and are functionally impor- 
tant enough to be maintained over time despite creating ex- 
cess deleterious mutation load. 

One powerful tool for disentangling the effects of selection 
and mutation rate variation is mutation spectrum analysis: a 
comparison of the relative abundance of specific types of 
mutations, often defined by their triplet context (ie., 
AAA>ACA or ACG>ATG) (Hwang and Green 2004). 
When background selection removes genetic variation from 
a genomic region, it removes variants that are essentially sam- 
pled at random from the spectrum of variation that is present. 
Biased gene conversion can affect the mutation spectrum, but 
only in a specific way, favoring the retention of mutations 
from A/T to G/C over mutations from G/C to A/T (Galtier et 


al. 2001). A much broader variety of mutation spectrum 
changes can occur when the mutation rate increases as a 
result of alteration to the mechanisms of DNA damage or 
repair, most famously in cancer where cells’ housekeeping 
processes often break down (Alexandrov et al. 2013). For 
example, tumors that replicate their DNA with a defective 
polymerase ¢ accumulate high rates of TCT>TAT and 
TCG>TTG mutations (Alexandrov et al. 2013; Shinbrot et 
al. 2014). Similar “mutational signatures” also occur in the 
normal human germline, where late-replicating DNA consis- 
tently accumulates proportionally more C>A and A>T muta- 
tions compared with DNA that replicates earlier during the cell 
cycle (Agarwal and Przeworski 2019). 

In addition to varying between regions of the genome, 
mutation rates and spectra also vary between different evo- 
lutionary lineages. Patterns of diversity point to a global mu- 
tation rate slowdown during hominoid evolution that has 
caused humans and other closely related apes to accumulate 
mutations more slowly than distantly related monkeys do; this 
slowdown has been studied since the 1980s (Goodman 1985; 
Li and Tanimura 1987; Scally and Durbin 2012). A closer ex- 
amination of ape mutation spectra recently revealed that ev- 
ery ape lineage has experienced changes in the relative 
mutation rates of some characteristic triplet motifs (Harris 
and Pritchard 2017). Even more surprisingly, closely related 
human populations have distinctive mutation spectra that 
provide enough information to classify individuals into conti- 
nental ancestry groups (Harris and Pritchard 2017). Sometime 
during the 100,000 years since their descendants migrated 
out of Africa, Europeans experienced a temporary pulse of 
mutagenic activity that more than doubled the rate of 
TCC>TTC mutations (Harris 2015; Speidel et al. 2019; 
DeWitt et al. 2021). 

Mutation spectrum divergence has the potential to shed 
light on both the mechanisms and overall speed of mutation 
rate evolution. In theory, many different biological mecha- 
nisms can cause mutation rates to evolve; these may be 
changes to cellular machinery (e.g., changes to a DNA repair 
protein) or to environmental/life history traits (e.g., longer 
generation time) (Shinbrot et al. 2014; Gao et al. 2019). In 
the event that all mutagenic processes generate diversity in a 
conserved and clocklike manner, mutation spectra are 
expected to stay constant over time. Conversely, if exposure 
to a particular mutagen increases or a DNA repair mechanism 
becomes less efficient, this is expected to elevate the relative 
dosage of mutations in genomic motifs that are most vulner- 
able to damage by that mutagen or preferentially targeted by 
that DNA repair mechanism. Different genetic and environ- 
mental perturbations might cause similar changes in the over- 
all mutation rate, but are less likely to elevate mutation rates in 
the same genomic sequence contexts. 

In this study, we examine how mutation spectra covary 
across the genome and the ape phylogeny and find that great 
ape mutation spectra exhibit similar patterns of mutation 
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spectrum divergence across both slowly and quickly mutating 
compartments of the genome. We find no evidence that 
species-specific mutator activity is correlated with chromatin 
State, ancient repeat content, or replication timing during S 
phase, despite the fact that all of these variables correlate with 
stable mutational signatures that are consistently detectable 
across the great ape phylogeny. This implies that the rapid 
evolution of great ape mutation spectra has likely been driven 
by trans-acting mutators that do not preferentially target any 
specific genomic compartments, at least not compartments 
that are correlated with the variables examined in this study. 
Although there exist many differences between functionally 
divergent compartments in the spectra of mutations that ac- 
cumulate, these differences appear to exhibit considerable 
Stability between great ape species and are not likely respon- 
sible for the rapid mutation spectrum divergence. We find 
that such stability extends to species as divergent as humans 
and mice, where we find consistent mutational signatures 
present in genomic regions that are not homologous but 
are annotated with the same functional states as promoters, 
enhancers, or repressed regions. 





Results 


Quantifying the Mutation Spectrum Differences between 
Great Ape Species and Subspecies 


Previous research utilizing the Great Ape Genome Project 
(GAGP) data showed that the germline mutation spectrum 
has evolved rapidly in great apes, leading to distinct species- 
specific spectra (Prado-Martinez et al. 2013; Harris and 
Pritchard 2017). We first sought to recapitulate these results 
and measure for the first time how the differences between 
species compare to differences within species (supplementary 
table S1, Supplementary Material online). 

To minimize the effects of natural selection and read map- 
ping errors on our mutation spectrum ascertainment, we de- 
fined a set of genomic regions, collectively called a 
“compartment,” characterized as nonconserved and nonrep- 
etitive (NCNR). This NCNR compartment consists of 1.28 Gb of 
the nonrepetitive (annotated by RepeatMasker), noncoding 
human genome excluding both significantly conserved regions 
(P<0.05 in the PhastCons 44-way primate alignment) and 
CpG islands (supplementary table S2, Supplementary 
Material online). In these compartments, we computed the 
relative abundances of each of the 96 triplet mutation types 
for each individual and each species (fig. 1A) using single nu- 
cleotide variants (SNVs) from the GAGP, following a number of 
filters. To ensure that shared genetic drift cannot inflate the 
appearance of mutation spectrum similarity among individuals 
that share many derived alleles, we computed mutation spec- 
tra using asampling procedure that randomly counts each SNV 
toward the spectrum of only one of the individuals that carry it, 
rather than all such individuals (see Materials and Methods). 





This sampling method reduces the impact of GC-biased gene 
conversion, which primarily affects higher frequency alleles in 
regions specific to lineages (see Materials and Methods, sup- 
plementary fig. S1, Supplementary Material online). 

A principal component analysis (PCA) on mutation spectra 
shows clustering of individuals by species in a manner that 
recapitulates phylogeny. This pattern reveals the existence of 
distinct species-specific mutation spectra (fig. 1B). Mutation 
spectra of humans, chimpanzees (Pan troglodytes), and bo- 
nobos (Pan paniscus), which form a phylogenetic clade, sep- 
arate from more distantly related gorillas (Gorilla gorilla) and 
Sumatran and Bornean orangutans (Pongo abelii and Pongo 
pygmaeus, respectively) along principal component 1 (PC1). 
PC2 separates humans from chimpanzees and bonobos in 
addition to separating gorillas from orangutans. The two 
orangutan species cluster closely together, which is unsurpris- 
ing given that their divergence time (~483 kya) is an order of 
magnitude more recent than that of humans and chimpan- 
zees (Prado-Martinez et al. 2013). PC1 appears dominated by 
the proportion of A>C, and A>T and components of C>T 
mutations, whereas PC2 is dominated by the proportion of 
C>A and C>G_ mutations (supplementary fig. S2, 
Supplementary Material online). The major A>C component 
of PC1, for example, corresponds to a 10-15% decrease in 
the fraction of A>C SNVs in gorillas and orangutans relative 
to humans, chimpanzees, and bonobos. PC2's C>A compo- 
nent corresponds to a 4% and 7% increase in the C>A SNV 
fraction in humans and gorillas relative to the Pan and Pongo 
clades, respectively. These results are robust to subsampling of 
equal numbers of individuals across species (supplementary 
fig. S3, Supplementary Material online). Although prior 
papers have reported homogeneity of DNM spectra between 
nonhuman great apes, we find that these studies are under- 
powered to detect species differences of the magnitude we 
observe here (Thomas et al. 2018; Besenbacher et al. 2019) 
(see Materials and Methods, supplementary table S3, 
Supplementary Material online). 

We even observed mutation spectrum differences among 
chimpanzee subspecies, as visible in the PCA: Western chim- 
panzees (P. troglodytes verus) overlap slightly less with bono- 
bos along PC1 than other chimpanzee subspecies, which 
begin to separate from bonobos along PC2. The mutation 
spectrum differences between Western and non-Western 
chimpanzee subspecies are more clearly visualized in a PCA 
run on those individuals alone (supplementary fig. S4, 
Supplementary Material online). The first PC demonstrates 
that the variance in mutation spectra between Western and 
non-Western chimpanzees significantly exceeds the variance 
among non-Western chimpanzees (P< 2.2 * 107'® two- 
sided t-test). Western chimpanzees experienced a population 
bottleneck when they diverged from the lineage ancestral to 
other chimpanzees, which might have accelerated their mu- 
tation spectrum divergence by allowing mutator alleles to drift 
to higher frequency (Kimura 1967; Lynch 2008; Prado- 
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Fic. 1.—Covariance of species-specific and replication timing mutation spectra in great apes. (A) SNVs segregating within a species were counted to 
generate a triplet mutation spectrum for each individual in the GAGP. We include SNVs found in NCNR regions of the genome, which we collectively call the 
NCNR compartment. (B) PCA of NCNR compartment mutation spectra reveals clustering of individuals by species. Each point represents the NCNR 
compartment mutation spectrum from a single individual in the GAGP; colors represent species, whereas shades of a color represent subspecies (applies 
only to gorillas and chimpanzees). (C) Mutation spectra are more similar between individuals of the same species than between individuals from different 
species. We plotted the Euclidean distances between triplet mutation spectra of all possible pairs of individuals in the GAGP within and between species (see 
Materials and Methods). (D) When NMF is used to infer mutational signatures that best explain variation among ape mutation spectra, we see more variation 
of signature composition between species than within species and identify signatures that change in dosage on specific phylogenetic tree branches. For 
example, signature 1 appears to have decreased in dosage on the branch ancestral to humans, chimps, and bonobos. 


Martinez et al. 2013; Sudmant et al. 2013; Lynch et al. 2016) populations exhibit more subtle mutation spectrum differen- 
(supplementary fig. S4, Supplementary Material online). ces that are not visible when spectra are projected onto the 
Gorilla subspecies, orangutan species, and human principal axes of ape variation (Harris and Pritchard 2017) 
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(supplementary fig. S4, Supplementary Material online). To 
quantify these mutation spectrum differences further, we em- 
bedded mutation spectrum histograms as points in a 96-di- 
mensional space and computed distances between the using 
a standard Euclidean metric. As seen qualitatively in the PCA, 
we found that interspecific differences exceeded conspecific 
differences (fig. 1C). Furthermore, interspecific differences 
scaled with divergence time. 

We undertook additional analyses to verify that the ob- 
served mutation spectrum differences are not likely caused 
by the tendency of natural selection to retain variation in cer- 
tain sequence contexts. Although GC-biased gene conversion 
does favor the fixation of G/C alleles and the elimination of A/ 
T alleles, none of the mutational signatures that vary in dos- 
age between great ape species are consistent with the spectra 
of mutations that for which biased gene conversion selects 
(supplementary fig. S2, Supplementary Material online). 
Moreover, when we stratify allele frequency within each spe- 
cies, we see the expected directional effect of biased gene 
conversion (supplementary fig. S5, Supplementary Material 
online). However, rare alleles have spectra that are no more 
similar across species than more common alleles that are 
expected to be more profoundly affected by biased gene con- 
version. Furthermore, any mutation spectrum difference 
caused by the action of natural selection should affect high- 
frequency alleles more than low-frequency alleles, so we re- 
peated several key analyses using only low-frequency variants. 
We performed these replicate analyses using doubletons 
rather than singletons, because singletons are more vulnera- 
ble to confounding by sequencing error, and found no qual- 
itative differences from the results obtained using the full 
frequency spectrum of genetic variants (supplementary fig. 
S6A and B, Supplementary Material online). 

We used non-negative matrix factorization (NMF) to explic- 
itly infer which mutational signatures have changed in dosage 
along different branches of the ape phylogeny (fig. 1D). 
Similar to PCA, NMF is a model-free method used to deter- 
mine major components of variance that underlie a matrix of 
data; the components NMF extracts from mutation spectrum 
matrices can be interpreted as mutation signatures, following 
the work of Alexandrov et al. (2013). We ran NMF on the 
matrix of individual NCNR mutation spectra using Helmsman, 
specifying the model to infer K = 6 signatures (Carlson, Li, et 
al. 2018). Figure 1D shows the dosage of each signature for 
every individual in the GAGP, grouped by species. Although 
each signature is present in every individual, the signatures 
clearly demonstrate lineage-specific dosage. For example, $3 
is present at moderate dosage in all nonhuman species but 
appears to have increased in relative rate in humans following 
the divergence of humans from the human-chimpanzee—bo- 
nobo common ancestor (supplementary fig. 7, 
Supplementary Material online). S4, conversely, has de- 
creased 2-fold on the branch leading to human- 


chimpanzee—bonobo; S2 has decreased nearly 3-fold on the 
branch leading to orangutans. These results support a prior 
hypothesis that the dosage of one or more mutational signa- 
ture has changed along each branch in the great ape phylog- 
eny (Harris and Pritchard 2017). 


A Mutational Signature Associated with DNA Replication 
Timing Is Conserved among Great Apes 


Differences in replication timing explain a substantial portion 
of the variation in somatic and germline mutation rate across 
the genome (Stamatoyannopoulos et al. 2009; Koren et al. 
2012). Compared with regions that replicate early in S phase, 
late-replicating regions tend to have a higher overall mutation 
rate, and in humans, they particularly harbor a higher rate of 
A>T and C>A mutations (Agarwal and Przeworski 2019). 
The established correlation between late-replication timing 
and elevated mutation rate implies that replication timing 
QTLs (rtQTLs) may be examples of cis-acting mutation spec- 
trum modifiers, with late-replication alleles acting to increase 
the load of a late-replicating mutational signature in the sur- 
rounding DNA. We analyzed late- and early-replicating com- 
partments of the genome to determine whether replication 
timing had a similar effect on the mutation spectrum across 
great apes. 

In an attempt to replicate the late-replication signature 
reported by Agarwal and Przeworski, we defined early- and 
late-replication timing compartments to be the earliest- and 
latest-replicating quartiles of the genome identified by 
RepliSeq in human lymphoblastoid cell lines (Koren et al. 
2012) (fig 2A, supplementary table S2, Supplementary 
Material online). We then generated separate mutation spec- 
tra from the early-replicating and the late-replicating compart- 
ments of each individual genome, normalizing the spectra by 
the triplet composition of each compartment. We ran a PCA 
on a matrix containing three mutation spectra derived from 
each human genome in the GAGP: two derived from early- 
and late-replicating compartments and the third derived from 
the NCNR compartment (fig. 2B, supplementary fig. S8, 
Supplementary Material online) and observed that PC1 sepa- 
rated out these spectra as a function of replication timing. To 
determine the principal triplet mutation types driving the dif- 
ferences in mutation spectra between compartments, we 
generated heatmaps of the log odds ratio enrichments of 
each mutation type occurring in the late versus the early- 
replication timing compartments (fig. 2C, supplementary 
fig. S9, Supplementary Material online). For this analysis, we 
counted the number of segregating sites within a species to 
generate a single 96-dimensional vector for each species and 
compartment (rather than a vector for each individual and 
compartment, see Materials and Methods). As expected, 
late-replicating regions were enriched for C>A and A>T 
mutations. 
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Fic. 2.—Signatures associated with replication timing appear conserved among great apes. (A) We calculated separate mutation spectra for each 
individual in compartments that replicate early versus late in S phase and reweighted each spectrum by the trinucleotide content of the associated 
compartment. (B) Each point in this PCA represents the mutation spectrum from a single individual’s NCNR, early-replicating, or late-replicating compart- 
ment. The observed gradient results from differences in mutation spectra between early-replicating and late-replicating compartments. (C) A heatmap of the 
log ratios of triplet mutation fractions in humans shows an enrichment for C>A and A>T mutations in the late-replicating compartment compared with the 
early-replicating compartment. This mutation signature recapitulates recently described late-replication timing signature in humans. To generate the species 
mutation spectra, we counted the number of SNVs with triplet context segregating within a species that occurred in each compartment. The triplet mutation 
fractions were normalized by compartment nucleotide content. (D) Mutation spectra from late- and early-replication timing and NCNR compartments from 
each individual in the GAGP separate along a first PC associated with phylogeny and a second PC associated with replication timing. The mutation signature 
associated with late-replication timing appears conserved among all great apes. Different shades of each species’ color represent subspecies, as in figure 1D. 





After replicating the reported effect of replication timing 
on the human mutation spectrum, we set out to determine if 
the action of this signature appeared conserved among spe- 
cies. To this end, we identified the ape genomic compart- 
ments that aligned to the human early-replicating and late- 
replicating compartments and ran a PCA on a matrix of the 
early-replicating, late-replicating, and NCNR compartment 
mutation spectra of all individuals (fig. 2D). PC1 separates 
the spectra by phylogeny and species identity, whereas PC2 
separates them along an axis that aligns with replication tim- 
ing. PC3 separates human from chimpanzee and bonobo 
spectra (not shown). The direction of separation of early- 


and late-replication timing compartments is similar across all 
species, implying the action of a conserved mutational process 
to a consistent compartment of the genome. The most par- 
simonious explanation for this pattern is that the replication 
timing landscape is largely conserved across great apes and 
that late replication exerts a similar mutagenic effect in all 
great ape species. As above, we obtained consistent results 
using only doubletons (supplementary fig. S6C and D, 
Supplementary Material online). A PCA of individual mutation 
spectra, using only doubletons, recreates similar clustering 
patterns to those presented in figure 2. Furthermore, the sec- 
ond PC again captures the mutational signature of late- 
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replication timing (supplementary fig. S6D, Supplementary 
Material online). The main effect of restricting to rare variants 
is that humans appear less displaced from chimps along the 
axis of differentiation by replication timing. These results sug- 
gest that differences between species in biased gene conver- 
sion or demographic history have a minimal effect on the 
observed trends (supplementary fig. S6, Supplementary 
Material online). 

To further quantify the conservation of the replication tim- 
ing mutational signature, we tested the correlation of the log 
odds of late- versus early-replication compartments between 
each pair of species, thereby quantifying the similarity be- 
tween each species’ late- versus early-replication timing mu- 
tational heatmaps. The correlations between every pair of 
species’ replication timing heatmaps were highly significant 
(supplementary fig. S9 and table S4, Supplementary Material 
online). Our results show that late-replication timing is asso- 
ciated with a conserved mutational signature across great 
apes. Moreover, these mutational patterns suggest that the 
genomic landscape of replication timing is broadly conserved 
across species, biasing all genomes toward C>A and A>T 
mutations in compartments that are directly orthologous to 
the human late-replicating compartment (supplementary fig. 
S10, Supplementary Material online). The conservation of this 
mutation signature contrasts with the rapid evolution of the 
species signatures. We see a few hints of interactions be- 
tween replication timing and species identity—for example, 
CG>GG mutations are depleted in late-replication timing 
regions, and the depletion appears slightly stronger in nonhu- 
man apes than in humans. However, these nonlinear effects 
are small and higher resolution sequencing data will be 
needed to verify whether they are true biological signals. 

The layering of rapidly evolving species signatures over con- 
served replication timing signatures was further supported by 
NMF decomposition. We ran NMF independently for each 
species on matrices of individual mutation spectra from the 
early-replicating and late-replicating compartments, after nor- 
malizing for compartment-specific nucleotide content. Each 
NMF was run to extract K = 7 signatures. Following the meth- 
odology of Alexandrov et al. (2013), we then grouped similar 
signatures by their loadings using hierarchical agglomerative 
clustering (supplementary fig. $11, Supplementary Material 
online). The clustering of these signatures, even following in- 
dependently run NMFs, supports the deep conservation of a 
mutation signature associated with replication timing that 
spans the great ape phylogeny. Furthermore, the loadings 
of these signatures are enriched for C>A and A>T mutation 
types, once again supporting our presented results. 





All Great Ape Genetic Variation Appears to Be Shaped by a 
Conserved Landscape of Cis-Acting Mutational Modifiers 


We found that all ape DNA, regardless of replication timing, 
has a consistent mutation spectrum bias that we call a 


species-specific signature. Late-replicating regions of chim- 
panzee genomes contain a mixture of both a chimpanzee- 
specific signature and the same late-replication signature that 
is found in human genomes (supplementary fig. S9, 
Supplementary Material online). We see little evidence of 
any mutational signature unique to late-replicating chimpan- 
zee DNA that is not also found in early-replicating chimpanzee 
DNA or in late-replicating regions of other ape genomes. 
Furthermore, we see no evidence that species-specific signa- 
tures have a rate or dosage that depends on replication 
timing. 

Using published annotations of the human genome, we 
defined several more overlapping functional compartments to 
characterize the phylogenetic distribution of spatially localized 
mutational signatures. We used RepeatMasker to delineate 
repetitive versus nonrepetitive DNA and used ENCODE 
chromHMM output (the intersection of heterochromatic 
regions in nine cell types) to annotate several types of hetero- 
chromatin (Ernst and Kellis 2012). Another compartment we 
annotated consists of ancient repeats, which have the poten- 
tial to mutate differently from higher complexity DNA via sev- 
eral mechanisms, including the formation of non-B-DNA 
secondary structures and the editing activity of antiviral 
enzymes such as APOBECs (Harris et al. 2002; Bacolla et al. 
2004; Guiblet et al. 2018) (supplementary table S2, 
Supplementary Material online). 

We ran a PCA for each species comparing the mutation 
spectra of all eight compartments and observed similar topol- 
ogies of compartment separation in each species. In all cases, 
the vector separating late-replicating from early-replicating 
compartments is nearly aligned with the first principal compo- 
nent, which explains 23.3-34.1% of the total variance. PC2, 
which explains 8.2-13.5% of the total variance, is similarly 
aligned with the separation between repetitive and nonrepe- 
titive compartments. Finally, PC3 (4.0-8.7% variance 
explained) separates the endogenous retroviruses (ERVs) 
from the other compartments to a much greater extent than 
either of the first PCs (fig. 3A-F, supplementary fig. S12, 
Supplementary Material online). The similarities of the inde- 
pendent PCAs across all species of great apes imply conserva- 
tion of the cis-regulated mutational signatures associated with 
repetitive content, methylation, and replication timing. Each 
compartment shows a similar degree of separation between 
species, with high degrees of correlation between the position- 
ing of compartments in different species’ PCAs (supplementary 
table S5, Supplementary Material online). 

We identified only one genomic compartment whose lo- 
calized mutational signature appears to be distributed nonun- 
iformly across species: maternal mutation hotspots. first 
identified using human trio data Jonsson et al. 2017) (sup- 
plementary table S2, Supplementary Material online). 
Maternal mutation hotspots are genomic regions that are 
enriched for de novo C>G mutations that arise on the ma- 
ternal lineage and whose rate has an unusually strong 
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Fic. 3.—Conserved axes of mutation spectrum variance among great apes. (A-F) We defined eight overlapping functional compartments to test for 
evolution of mutation spectrum modifiers along axes of chromatin accessibility, replication timing, and repetitive content. We then ran a PCA on the 
individual mutation spectra for all eight compartments for each species separately (only human, chimpanzee, and Sumatran orangutan shown). For all 
species, PC1 and PC2 separate compartments along gradients that correspond to replication timing and repetitive content, respectively (dotted lines vs. 
shaded polygons, A-C). PC3 separates ERVs from other compartments (shaded polygons, D-F). The similarities of these independent PCAs across all species 
imply conservation of cis-acting mutational signatures. *Axis inverted for readability. 


correlation with maternal age. These hotspots exist in chim- 
panzees and, to a lesser extent, gorillas, but their signal is 
nearly absent from orangutans. Our mutation PCAs similarly 
show that human genomes have higher levels of a 
compartment-specific mutational signature in these regions 
(supplementary fig. S13, Supplementary Material online). The 
separation of NCNR to maternal mutation hotspot mutation 
spectra compartments decays with phylogenetic distance 
from humans, recapitulating findings from Jonsson et al. 
(2017). Although a trans-acting protein might be involved in 
the creation of C>G mutations at maternal hotspots, perhaps 
due to error-prone repair of double-strand breaks, the target- 
ing of damage toward specific regions fits the profile of a cis- 


acting targeting factor. Either this cis targeting factor or a 
trans interacting partner has been intensifying its mutagenic 
effects along the evolutionary lineage leading to humans, 
causing extra mutations to accumulate in a localized pattern. 
Although differences in maternal age at conception may be a 
partial explanation for these observations, generation time 
differences cannot fully explain the patterns across all great 
apes. The strength of this signature certainly decreases with 
generation time among humans, Pan clade, and gorillas (29, 
24, and 19 years, respectively). However, orangutans have a 
higher average maternal age (25 years) at conception than 
that of gorillas and the Pan clade but exhibit the weakest 
dosage of the C>G signature (Besenbacher et al. 2019). 
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The compartment that harbored a signature whose dosage 
was the strongest among the presently investigated compart- 
ments is CpG islands, genomic regions ranging from 200 to 
2,000 bp long that are enriched for CpG dinucleotides 
(Gardiner-Garden and Frommer 1987; Larsen et al. 1992) 
(supplementary table S2, Supplementary Material online). 
CpG islands are the only compartment we identified whose 
differentiation from the NCNR compartment explains a 
greater proportion of mutation spectrum variance than differ- 
entiation between species across these regions. Outside of 
these CpG islands, CpGs are often methylated to 5-methyl- 
cytosine, which mutate to TpG at a rate ten times higher than 
unmethylated CpGs. In contrast, CpG islands are hypomethy- 
lated and are often situated in conserved 5’ promoters and 
genic regions. We hypothesized that, due to their lack of 
CpG>TpG mutations and overall conservation, a compart- 
ment containing CpG islands would demonstrate a contrast- 
ing mutation spectrum relative to that of the NCNR 
compartment. A PCA of individual mutation spectra from 
the NCNR and a CpG compartment from all GAGP individuals 
demonstrated that differentiation of the CpG island compart- 
ment exceeds the magnitude of spectrum differentiation be- 
tween species (supplementary fig. S14, Supplementary 
Material online). This is unsurprising given the unique muta- 
tional properties of CpG islands compared with the rest of the 
genome. 


Endogenous Retroviruses Carry a Distinct Mutational 
Signature Conserved across Great Apes 


ERVs are a class of repetitive, transposable DNA elements that 
duplicate themselves in a copy and paste manner. The act of 
duplication into new regions of the genome can disrupt func- 
tion; for example, integration into the coding region of a gene 
could result in a complete knock-out of gene function. 
Therefore, ERV activity is restricted by a number of known 
mechanisms, including hypermethylation, inhibition of inte- 
gration, and hypermutation. 

In light of these mechanisms that target ERVs, we were 
intrigued by the fact that ERVs separated from other geno- 
mic compartments along the third principal component of 
our mutation spectrum analysis (4-8% variance explained). 
ERVs bear an excess load of a unique mutational signature 
that appears to be largely conserved among great apes (fig. 
2B, D, and F) and is previously undescribed, to our 
knowledge. 

To determine whether any component of the ERV signa- 
ture could be caused by high rates of methylation and heter- 
ochromatization, we directly compared the mutation 
spectrum of the ERV compartment to that of nonrepetitive 
heterochromatin. We calculated the log-ratio enrichments 
and depletions of the 96 mutation types in ERVs relative to 
nonrepetitive heterochromatin for each species and found an 
enrichment for CoG C>G mutations and a depletion of 


TAA>TTA mutations in ERVs that appears conserved in all 
species other than P. pygmaeus (fig. 4A). This comparison 
shows that ERVs’ high rate of CoG>CpT transitions is likely 
caused by their heterochromatic status, but that heterochro- 
matinization cannot explain the other components of the ERV 
signature. Furthermore, we determined that differences in 7- 
mer nucleotide content between the two compartments 
explained some, but not all, of the ERV-specific enrichment 
for CG>GG mutation types (supplementary fig. S15, 
Supplementary Material online). 

We hypothesized that the CoG C>G mutational signature 
could result from the high and variable rates of CpG hydrox- 
ymethylation (hmC) of ERVs, which has been recently shown 
to increase rates of C>G mutations (Supek et al. 2014). To 
test this hypothesis, we compared the mutation spectra of 
ERVs with versus without evidence of hmC CpG, based on 
hmC-specific sequencing of human embryonic stem cells (Yu 
et al. 2012) (supplementary table $2, Supplementary Material 
online). ERVs with hmC showed a significant enrichment for 
CpG C>G mutations compared with ERVs without hmC in all 
six species, supporting the hypothesis of hmC-related muta- 
genesis in ERVs (fig. 48, supplementary table ‘6, 
Supplementary Material online). We assessed the robustness 
of the CoG C>G mutational signature to differences in map- 
ping quality, compartment size, and species-specific nucleo- 
tide content, finding that the CoG C>G mutational signature 
was robust to all quality control tests (fig. 4C, supplementary 
figs. S16 and S17, Supplementary Material online). 


Conserved Mutational Signatures Are Associated with 
Functional Genomic Elements in Distantly Related 
Mammalian Species 


After observing that compartment-associated mutational sig- 
natures appear to be conserved among great ape species, we 
hypothesized that such conservation might extend to even 
more distantly related species and tested this hypothesis using 
annotations of epigenetic function that were generated by 
the ENCODE project for humans and mice. Specifically, we 
split the human and mouse genomes into six different func- 
tional compartments based on chromHMM annotations gen- 
erated from epigenetic assays run in ESCs and mESCs, 
respectively (Ernst and Kellis 2012; Pintacuda et al. 2017) (sup- 
plementary table $7, Supplementary Material online). These 
chromHMM-defined compartments do not necessarily corre- 
late with the various compartments we analyzed across great 
apes. We posited more generally that mutational signatures 
associated with specific compartments that experience spe- 
cific molecular interactions (DNA-binding proteins, histone 
modifications, chromatin state, e.g.) are relatively stable 
among similar genomic regions in different species. Across 
these compartments, we calculated normalized mutation 
spectra using publicly available whole-genome polymorphism 
data from 2,504 diverse humans and 67 wild-caught Mus 
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Fic. 4.—An hmC-related CG>GG mutation signature distinguishes 
ERVs from other compartments. (A) Heatmaps of log odds of triplet mu- 
tation spectra comparing ERV to nonrepetitive heterochromatin compart- 
ments for each species show a significant ERV-specific CG>GG mutation 
signature. (B) Enrichment of CG>GG mutation types in ERVs with hmC 
compared with ERVs without. Points represent the fraction of each triplet 
mutation in ERVs with and without hmC calculated from SNVs segregating 
within each species (y and x axes, respectively). Mutation types that fall 
along the y = x line occur equally frequently in both compartments. 
utation type labels are included only for mutation types whose log ratio 
of ERV hmC+:ERV hmC— exceeds 0.4. Points are colored by species. (C) 
The CG>GG mutation signature in ERVs is robust to the size and shape of 
he ERV compartment. We created 100 “ERV-like” compartments by 
sampling segments corresponding to the size of those in the ERV com- 
partment from random locations within the nonrepetitive heterochroma- 
in compartment. The distribution of the log odds of CG>GG mutations 
between these ERV-like to the original nonrepetitive heterochromatin 
compartment is violin plots. The log odds of CG>GG mutations between 
he original ERV and nonrepetitive compartment are plotted as dots for 
reference, with 95% confidence interval. 








musculus and Mus spretus individuals (The 1000 Genomes 
Project Consortium 2015; Harr et al. 2016). Ancestral states 
for human polymorphisms were determined with regards to 
an inferred reconstructed genome representing the most re- 
cent common ancestor of humans; ancestral states for mouse 
were polarized based on an aligned rat reference genome 
(rn6) (The 1000 Genomes Project Consortium 2015). 
Polymorphisms were subject to filters similar to those applied 
to the great ape data: sites that either failed quality filters, 
were missing from >20% of haplotypes, or had a minor allele 
frequency of 1/2 N (e.g., singletons) were excluded from anal- 
ysis. Sites fixed in the sampled haplotypes were also excluded. 
We employed a similar randomization strategy to avoid struc- 
ture due to shared variation, but summed together the mu- 
tation spectra of humans from the same subpopulation and 
mice from the same sampling site and subspecies to avoid 
sparseness (N=26, 9 respectively; see Materials and 
Methods). 

Two separate matrices containing mutation spectrum data 
from all compartments in each respective species were 
decomposed using PCA, and we observed several common- 
alities between human and mouse in the spatial arrangement 
of the chromHMM compartments within the span of the first 
two principal components (fig. 5). Unsurprisingly, mouse spe- 
cies separate to a greater degree than do human populations, 
but within a subspecies the relative positioning of spectra 
from different functional compartments mirrors that observed 
in humans, especially after a roughly 30-degree rotation of 
the PC axes. In humans, PC1 largely separates promoters from 
other compartments, and PC2 separates transcribed regions. 
Enhancers and insulators cluster together, indicating mutation 
spectra more similar relative to other compartments. These 
same features are true of the mouse PCA, except that the 
vector separating promoters from other compartments is in- 
termediate between PC1 and PC2. Mutation spectra from the 
heterochromatin or repressed compartments cluster more 
closely to insulators and enhancers, respectively, in mouse 
and human, respectively; this phenomenon could be due to 
annotation differences. Furthermore, the loadings of PC1 and 
PC2 are significantly correlated between the two separate 
PCAs (Pearson's p = 0.486, 0.555, P=5.14 x 10-7, 4.44 x 
10-°, respectively) (supplementary fig. S18, Supplementary 
Material online). As expected, CpG transition rates are 
weighted heavily along PC1, likely reflecting the unique meth- 
ylation patterns that regulate promoter function, but other 
types of transitions and transversions also appear to have dif- 
ferent rates among these functional compartments. Both 
C>A and A>T, the mutation types associated with late- 
replication timing in great apes, appear negatively associated 
with PC2, a pattern that is consistent with the tendency of 
transcribed regions, which are positively displaced along PC2, 
to occur in early-replicating regions. Although ChromHMM- 
defined compartments are not necessarily orthologous be- 
tween human and mouse, their shared epigenetic markings 
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Fic. 5.—Structure of mutation spectrum variation is conserved between mouse and human. (A, B) We ran PCA separately for mutation spectrum data 
from mouse (A) and human (8) individuals, whose genomes were split into compartments annotated by chromHMM (PC1 of mouse reversed for readability). 
The relative positioning of compartment mutation spectra within a species or subspecies is similar, particularly after a roughly 30-degree rotation for the 
mouse PC1 and PC2. For both species, PC1 separates promoters from other compartments, whereas PC2 distinguishes transcribed regions. The loadings of 
PC1 and PC2 are significantly correlated between the two separate PCAs (rho = 0.486, 0.555, P= 5.14 x 107”, 4.44 x 107°, respectively) (supplementary 


fig. S18, Supplementary Material online). 


and functions appear to be enough to cause a common set of 
sequence motifs to be particularly vulnerable to mutation. 


Discussion 


Despite considerable documented evidence of mutation spec- 
trum variation across genomic space and phylogenetic time, 
little was previously known about the covariation of mutation 
spectra along spatial and temporal axes. Our results show that 
such covariation is negligible, at least in great apes: spatial and 
temporal mutation spectrum variation are largely orthogonal 
to one another. Replication timing, repeat content, and other 
functional categories have consistent mutational biases across 
all great ape species. At the same time, each species has a 
distinctive mutation spectrum bias that affects all functional 
compartments we have analyzed. There exist some excep- 
tions to this general rule, most notably in the compartments 
of the genome that accumulate a maternal-age-related sig- 
nature in humans that is attenuated in chimpanzees and 
gorillas and nearly nonexistent in orangutans. Nevertheless, 
our results show that mutation spectrum divergence between 
ape species is mostly driven by processes that act promiscu- 
ously across the genome. Determining how broadly this con- 
clusion applies to species beyond great apes will be an exciting 
avenue for future work. 

Our results provide evidence that mutation rates in late- 
replicating regions are elevated due to a mechanism whose 
activity pattern over great ape evolution has been largely 


conserved. The distinct signatures associated with different 
great ape species demonstrate that the simplest possible 
“hominoid slowdown” model is likely not sufficient to explain 
all differences between great ape mutation rates. On the one 
hand, some papers have suggested that increasing reproduc- 
tive age is enough to explain differences in mutation rates that 
have been inferred by analyses of the branch lengths of great 
ape phylogenies. However, we see no evidence that differ- 
ences between great ape species’ mutation spectra can be 
explained by varying the dosage of a single mutational signa- 
ture associated with increased parental age, or even separate 
signatures associated with paternal and maternal ages of the 
kinds that have been inferred from human DNM data. 

To understand why species-specific mutational signatures 
are not obviously biased toward particular genomic regions, it 
is helpful to consider previous theoretical work on the dynam- 
ics of alleles that drive mutation rate evolution (Sturtevant 
1937; Kimura 1967; Lynch 2008; Lynch et al. 2016). 
Although a genetic modifier of the mutation spectrum will 
not necessarily change the mutation rate, the most parsimo- 
nious scenario is for new mutations that alter the mutation 
spectrum to slightly increase the overall mutation rate by 
impairing the faithful replication of particular motifs. 
Kimura, Lynch, and others have noted that selection against 
alleles that increase the mutation rate is driven by selection 
against the excess deleterious variation that such alleles beget. 
However, most new variants created by a trans-acting muta- 
tor will be on different chromosomes or distant parts of the 
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same chromosome that will immediately recombine with 
other genetic backgrounds. The only deleterious mutations 
that are likely to cause selection against the mutator are muta- 
tions that happen to occur in a small window that maintains 
high linkage disequilibrium with the mutator locus. In con- 
trast, a mutator that affects mostly neighboring DNA will tend 
to stay linked to more of the deleterious mutations it creates, 
making it more susceptible to purifying selection and loss 
from the population. A trans-acting rate modifier that targets 
specific genomic regions might not experience strong linked 
selection itself, but the associated genomic targets might ex- 
perience selective pressure to stop attracting targeted muta- 
genic activity. 

Some species-specific signatures might be the footprints of 
environmental mutagens, but trans-acting genetic modifiers 
are more parsimonious explanations for signatures that affect 
larger clades of multiple species. The mutations we analyzed 
here are all segregating variants that originated long after 
modern ape species had become reproductively isolated, 
and environmental exposures are not likely to have respected 
phylogenetic boundaries for millions of years after the com- 
pletion of ape speciation. Fixed differences between polymer- 
ases, DNA repair factors, and/or their regulatory elements are 
more likely to be responsible for differences in mutation spec- 
tra that respect phylogenetic structure and act consistently 
across the genome. 

We have noted that all ape species exhibit some internal 
mutation spectrum substructure, with Western chimpanzees 
being the most distinctive subspecies. Western chimpanzees 
are different from other chimpanzee subspecies in several 
ways, including lower levels of bonobo gene flow, a higher 
load of transposable elements, and a stronger population bot- 
tleneck in their recent history. Both transposable elements 
and accelerated genetic drift may have hastened this lineage’s 
rate of mutation spectrum drift. 

Although selection, drift, and demographic history cer- 
tainly affect genome-wide genetic diversity and variation in 
diversity across the genome, these forces are not reasonable 
explanations for the patterns of mutation spectrum diver- 
gence we present in this manuscript. Biased gene conversion 
selects for mutations from A/T to G/C, but none of the mu- 
tational signatures that vary between compartments or spe- 
cies fit this simple profile. In coding regions, selection 
generally allows synonymous mutations to reach higher fre- 
quencies than nonsynonymous mutations, but coding regions 
comprise a nonexistent to negligible fraction of the various 
compartments analyzed here, and the universality of the ge- 
netic code prevents selection against nonsynonymous substi- 
tutions from generating any distinctive species-specific 
differences. It is generally not plausible to suppose that the 
same mutational signature would be selected for in a single 
lineage across all the noncoding genomic compartments we 
analyze here, as this would require mutations in many triplet 








contexts to have fitness effects that were somehow consistent 
across the whole genome. 

Although identifying the causal fixed differences still rep- 
resents a challenging unsolved problem, the insights from this 
paper will allow us to narrow the field of possible mutation 
spectrum modifiers to exclude ones that target only subsets of 
the genome. Focusing on ERVs in detail, we were able to use 
functional genomic annotations to link this compartment’s 
CG>GG mutational signature to hmC of CpG sites. 
Examination of a broader set of functional genomic data 
may facilitate the interpretation of other localized signatures 
and bring us closer to understanding their causality. 

Previous work estimated that 80% of spatial mutation rate 
variation could be explained by letting mutation rates depend 
on an extended 7-mer sequence context (Aggarwala and 
Voight 2016; Carlson, Locke, et al. 2018). As 7-mer compo- 
sition differs between genomic compartments, these ex- 
tended sequence context models likely derive some of their 
predictive power from the effects of cis-acting mutational 
modifiers. However, we have shown that compartment anno- 
tations provide extra information about mutability above and 
beyond what we can tell from extended sequence context 
alone, at least in the case of the ERV hmC signature. An im- 
portant avenue for future work will be to examine the con- 
verse possibility and determine how much of the dependence 
of mutability on extended sequence context can be explained 
by genomic compartmentalization. 

A small proportion of the genome is expected to vary in 
mutation rate between individuals due to the presence of 
rtQTLs, but our results suggest that the coarse shape of the 
replication timing landscape is stable across the great ape 
clade (Koren et al. 2012). The genomic distribution of the 
replication timing signature could even be leveraged to esti- 
mate the extent of TAD variation within and between species. 
More generally, if mutational signatures of chromatin states 
and various epigenetic modifications prove stable and inter- 
pretable over large phylogenetic clades, they represent a valu- 
able source of information about the evolution of chromatin 
structure and function in nonmodel organisms where only 
genome sequences are available. 





Materials and Methods 
SNV Filtering 


We ascertained mutation spectra from a set of high-coverage 
great ape SNVs that were previously called and filtered by 
Prado-Martinez et al. (2013). For each species and compart- 
ment, we collated the set of biallelic SNVs falling within the 
genomic segments that comprise that compartment. 
Ancestral states were assigned using a parsimony approach. 
Briefly, a biallelic site segregating within a genus (Homo, Pan, 
Gorilla, or Pongo) was polarized to the allele fixed in all other 
genera. Sites segregating in multiple genera, sites with 
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multiple fixed alleles, and sites with more than two alleles in a 
single genus were excluded due to their inconsistency with 
the assumptions of no balancing selection and only one mu- 
tation event per site. Singletons were also excluded due to 
their higher likelihood of sequencing error. We used the in- 
ferred ancestral base to classify 3’ and 5’ neighboring nucleo- 
tides. For 3-mer mutational analyses, we excluded SNVs 
whose 3’ and 5’ neighboring nucleotides were an ‘N’ in the 
hg18 reference and SNVs demonstrating evidence of recur- 
rent mutation in the great ape lineage; we expanded this filter 
to include the three 3’ and 5’ neighboring nucleotides for 7- 
mer mutational analyses. Finally, we removed SNVs out of 
Hardy-Weinberg Equilibrium with excess heterozygosity (us- 
ing an exact test, P< 0.05). Excess heterozygosity at a locus 
could indicate a cryptic segmental duplication with a single, 
fixed mutation in a copy. We also excluded SNVs with >0.5 
derived allele frequency to avoid mutational classes with an 
elevated risk of ancestral state misidentification. 


Computing the Mutation Spectra of Individuals and 
Species 


The PCAs in this paper require the computation of mutation 
spectra from individual genomes, whereas the complemen- 
tary heat map analyses involve calculating aggregate muta- 
tion spectra from larger samples. Each analysis employs the 
filtering system described above and ultimately involves 
counting the number of filtered derived alleles, classified by 
3-mer context. However, slightly different calculation details 
are involved in the two cases. 

The aggregate mutation spectrum of a species S is 
obtained from a set of counts Cim,, S),...,CiMg.6, S) where 
Mj,.-..Mg96 are the 96 3-mer mutation type categories 
AAAS C,..., TCT>T. The count C(m,, S) is the total number 
of SNVs segregating in species S that fall into the mutationa 
equivalence class m;. To compare spectra across samples with 
different amounts of variation, these mutation type counts 
are normalized to obtain a 96-dimensional histogram with 
frequency categories summing to 1. 

A mutation spectrum can be similarly calculated from a 
particular individual / as the distribution of 3-mer mutation 
types across the derived alleles present in /'s genome. 
Homozygous derived alleles are given twice the weight of 
heterozygous alleles such that the spectrum is the average 
of the spectra one would compute from the two phased 
haplotypes making up /'s diploid genome. 

When individual mutation spectra are computed in this 
way, two types of derived alleles can contribute to spectrum 
covariance between individuals / and J. The first type is pairs of 
derived alleles that occur at separate loci in / and J but belong 
to the same mutation equivalence class. The second type is 
derived alleles inherited by both / and / from a common an- 
cestor. To maximize our power to detect mutation spectrum 
evolution and distinguish it from shared genetic drift, we 





devised a randomization strategy to eliminate the second 
source of signal while preserving the first. 

This randomization strategy involves computing the muta- 
tion spectrum of individual / from only a subset of the derived 
alleles present in /’s genome. If one copy of a particular de- 
rived allele is present in /’s genome and has frequency k/2N in 
the GAGP panel, the allele will be counted toward /'s muta- 
tion spectrum with probability 1/k. Conversely, this derived 
allele will be counted toward the mutation spectrum of ex- 
actly one ape haplotype that carries it, with the identity of that 
haplotype chosen uniformly at random. 


Comparing Mutation Spectra across Genomic 
Compartments 


Comparing mutation spectra between regions of the genome 
required accounting for differences in compartment size and 
nucleotide content. Larger compartments naturally had more 
mutations than smaller ones; it was therefore necessary to 
compare mutation fractions rather than raw counts. 
Furthermore, differences in nucleotide content between com- 
partments could bias our comparison and calculation of local 
mutation rates. For example, a particular compartment could 
have a relatively high count of AAA>ACA SNVs, but this high 
count might be caused by the compartment having many 
occurrences of the triplet AAA, and therefore more opportu- 
nities for an AAA>ACA to occur. Thus, we rescaled the num- 
ber of mutations for each compartment by the nucleotide 
content of the NCNR compartment before calculating frac- 
tions. To calculate the rescaled rate rm) of mutation m;: 
{mj,..., Mg6} corresponding to triplet t {t;,..., ts2} and 
compartment C: 








TENCNR 
Ric = Mic 
tC 
Ric 
(im) == 
dy Ric 


We calculated triplet content of each compartment by sliding 
a 3-bp window, 1bp at a time, across each compartment. 
Triplets whose central mutation was contained within the 
boundaries of a compartment segment but whose 3’ or 5/ 
flanking mutations fell beyond and triplets with N’s were 
excluded. 

Several statistical analyses comparing two different muta- 
tion spectra required count data rather than frequency data 
(e.g., Chi-square tests). We devised a slightly different rescal- 
ing strategy for these counts to avoid artificially inflating the 
mutation counts. To calculate the rescaled count of mutation 
mi: {M1,..., Mg6} Corresponding to triplet t: {t;,..., t32} and 
compartment C; in preparation for comparison to compart- 
ment C> we scaled down the raw count of mutation m in the 
compartment where m is more abundant, rather than scaling 
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up the count of m in the compartment where it is more 
abundant: 


tt ie) 
Mic, * rae = tha 
Rig _— tq 


Mic stecg < tig 


The same rescaling is used for compartment C3, switching the 
subscripts accordingly. 


Statistical Analyses 


We generated plots and performed statistical analyses in R 
(version 3.1.0) using scripts available at https://github.com/ 
harrispopgen/gagp_mut_evol. 

We ran PCAs on matrices (k x C rows by 96 columns, 
for k individuals, and C compartments) of rescaled 3-mer 
mutation rates calculated for each individual and each com- 
partment using the prcomp method. Some PCAs were run 
on individuals from all species; others were only run on 
individuals from a single species. The matrices were cen- 
tered and scaled, as is standard for prcomp. The PCA load- 
ing heatmaps display the weights associated with each of 
the 96 3-mer mutation types for a given PC. The Euclidean 
distance ridge plots were similarly generated with individ- 
ual mutation spectrum data that required no rescaling be- 
cause the analysis considered only a single compartment. 
We plotted the distribution of Euclidean distances based on 
a2 x 96 matrix comparing the mutation counts between 
two different individuals of either a single or two different 
species. For comparisons within species, we ran k choose 2 
tests (k being the number of individuals for a given species); 
for comparisons between two species, we ran k x / tests (k 
and / being the numbers of individuals in both species, 
respectively). 

The log-odds heatmaps were generated to display the 
relative enrichment or depletion of specific mutation types 
when comparing two compartments directly to each other. 
For a given species, we plotted the log transform of the 
ratio between the rescaled mutation rates of two 
compartments. 

The 7-mer content-corrected heatmap required 7-mer 
mutation and nucleotide content from various compartments, 
which were generated using the 3-mer mutation and nucle- 
otide methods and equally expanding context on the 5’ and 3’ 
side of the central base. Each original 3-mer mutation m3 
{AAA>ACA, AAA>AGA ... TCT>TTT} is a collapsed equiv- 
alence class of 256 unique 7-mer mutations m7 ; \ 
{AAAAAAA>AAACAAA, AAAAAAC>AAACAAC, 
TTAAATT>TTACATT}. To explicitly re-weight the counts of 
each 3-mer mutation m3 ; cin compartment C using the ratio 
of 7-mer content in Cs, c {AAAAAAA, AAAAAAC, ... 
TTTCTTT} to that of compartment C’: 








Ss,c7 
Rric = >> MAC 
n SC 


s)he 
r(m3,) = > Rajc 


We used this method to rescale 3-mer mutations from the 
ERV compartment to match the 7-mer content from the non- 
repetitive heterochromatin compartment; the heatmap in 
supplementary figure S17, Supplementary Material online 
presents the log ratio of the rescaled ERV mutation and the 
nonrescaled nonrepetitive heterochromatin mutation spectra. 
We ran correlation analyses to quantify the similarities 
between the mutation spectrum heatmaps between species. 
Each mutation spectrum heatmap comprised the log ratio 
between each of the 96 mutation types between two com- 
partments for a single species. To calculate the similarity be- 
tween heatmaps for two different species, we ran a Pearson 
correlation test on the paired vectors of log odds. 


Compartments 


We defined compartments based on published annotations 
of genomic features. Each compartment was a list of genomic 
segments in a bed file format. The following is a list of the 
compartments used in our analyses and a short description of 
how we generated them. 


Nonconserved and Nonrepetitive 


The entire hg18 genome, excluding repetitive elements de- 
fined by repeatMasker, conserved regions in primates based 
on the phastCons 44-way multispecies alignment, CpG 
islands from the UCSC genome browser, and coding exons 
from refGene. 


ERVs 


All ERVs in repeatMasker run on hg18, excluding those clas- 
sified specifically as mammalian long-terminal repeats 
(MaLRs). MaLRs are believed to be largely inactive in great 
apes, unlike ERVs which are still active in several species. 


LINEs 
All LINEs in repeatMasker run on hg18 


Heterochromatin 


The intersection of the heterochromatin domain called in the 
hg18 chromHMM run on 9 different cell types from ENCODE 
(Gm12878, H1 HESCs, HepG, Hmec, Hsmm, Huvec, K562, 
Nhek, and Nhlif), minus repetitive elements defined in 
repeatMasker. 
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Early-/Late-Replicating Regions 


The genomic quartiles that replicate earliest and latest during 
S phase were ascertained using replication timing data from 
Koren et al. (2012). In that manuscript, the fine-scale repli- 
cation timing of regions in the genome was determined by 
sequencing human lymphoblastoid cell lines at S1/G phase; 
read depth over a region in the genome corresponded to its 
average relative replication timing. The read depths were 
measured at specific genomic positions. We calculated aver- 
age replication timing for nonoverlapping 20-kb window 
that included at least one measurement of replication timing 
and calculated replication timing quartiles. The earliest and 
latest quartiles were used as compartments. Early-/late-rep- 
licating, repetitive/nonrepetitive regions were the sub- 
sets of the replication timing compartments that 
overlapped with or excluded repeatMasker-annotated 
repeats, respectively. 


Human Maternal Mutation Hotspots 


Human maternal mutation hotspots defined in Jonsson et 
al. (2017) as regions whose DNM rate strongly associates 
with maternal age, lifted over to hg18 (Jonsson et al. 
2017). 


ERVs = 5hAMC 


ERV compartment, split into segments that had or lacked ev- 
idence of > 1 ShmC site, based on Tet-assisted bisulfite se- 
quencing of human ESCs (Yu et al. 2012). 


Quality Control Analyses 


We ran a number of analyses to test the robustness of our 
methods and findings. 


Determining How GC-Biased Gene Conversion Contributes 
to the Separation of Species and Compartments in PCA 
Plots 


GC-biased gene conversion (gBGC) is the nonreciprocal 
copying of short DNA tracts between homologous chromo- 
somes during meiosis, which at heterozygous sites tends to 
retain G/C (strong, S) alleles more often than A/T (weak, W) 
alleles (Holmquist 1992; Eyre-Walker 1993, 1999). This pro- 
cess causes G/C derived alleles to have higher substitution 
rates and frequencies than A/T alleles. If the strength of 
gBGC were not conserved across the great ape phylogeny, 
we might expect species’ and compartments’ mutation spec- 
tra to separate along an axis defined by the ratio of W-to-S 
and S-to-W mutation types, with species and compartments 
that experience the most gene conversion having the highest 
proportions of S-to-W variants. Such a gradient would be 
expected to dissipate, however, if we constructed mutation 
spectra using a higher proportion of rare variants, which are 


younger than common variants and have had less time to be 
influenced by the effects of gBGC. To this end, we tabulate 
mutation spectra in the following way: each SNV present in k 
haplotypes is counted toward the mutation spectrum for 
only one randomly selected haplotype; this process is de- 
scribed above (“computing the mutation spectra of individ- 
uals and species”) and hereafter referred to as 
“randomization.” We observe that individuals cluster by spe- 
cies in a PCA run on NCNR mutation spectra whether we 
employ randomization or count each SNV toward every hap- 
lotype on which the derived allele appears. Furthermore, the 
principle component loadings are not consistent with the 
expected signature of gBGC. Many of the mutation types 
that have different mutation fractions in different ape species 
are W-to-W or S-to-S mutation types that are expected to be 
unaffected by gBGC (supplementary fig. S2, Supplementary 
Material online). Even if we compute mutation spectra en- 
tirely from rare doubleton variants, we observe similar spe- 
cies separation to what we observe using more common 
variants, despite these rare variants being more weakly af- 
fected by gBGC. 

The rates of gBGC covary across the genome with recom- 
bination rate. The locations of recombination and gBGC 
hotspots change rapidly and are often species-specific due 
to evolution of the gene PRDM9. Thus, comparing mutation 
spectra between species in locations where gBGC rates are 
most divergent should 1) indicate an upper bound on the 
effects of gBGC rate evolution on our mutation spectrum 
analyses and 2) test our capacity to minimize its effects 
through the randomization method. We therefore examined 
the mutation spectra of species-specific recombination hot- 
spots with and without our random sampling method (sup- 
plementary fig. $1, Supplementary Material online). These 
analyses showed that, without random sampling, differences 
in mutation spectra within recombination hotspots are in 
fact dominated by the cis-acting effects of gBGC. 
Supplementary Figure S1A, Supplementary Material online 
shows a PCA of individual mutation spectra from the 
NCNR compartment and two compartments containing 
the genomic regions whose recombination rates fall within 
the upper and lower genome-wide deciles in humans (Kong 
et al. 2010); this spectrum is not thinned by randomization. 
We can clearly see the mutation spectra cluster by species, 
but we can also see high-recombination rate and low- 
recombination rate compartments separate along PC2, espe- 
cially in humans where the ascertained recombination hot- 
spots are active. PC2’s loadings are dominated by A>G and 
A>C mutations, which are both W>S mutations and likely 
indicate a human-specific enrichment for gBGC. In contrast, 
the separation between high recombination and NCNR com- 
partments is mostly attenuated in supplementary figure S1B, 
Supplementary Material online with the use of the random- 
ization method. 
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Comparing SNV to DNM Spectra 


Several papers have recently reported homogeneity among 
the DNM spectra of great apes. Our analyses, however, dem- 
onstrate that mutation spectra generated from SNVs vary sig- 
nificantly between species. We therefore compare our SNV to 
DNM spectra from Besenbacher et al. (2019) to assess the 
difference in results and find that DNM spectra are under- 
powered to detect the species signatures we find in SNV 
spectra. 

We test the likelihood that the distribution of the observed 
DNM counts in an individual is pulled from the SNV spectrum 
of a given species s. We calculate the expected probabilities P 
in [Pasc s Pase s--- Pest sl, given the number of SNVs 
segregating in species s in the GAGP of mutation type m; ; 
in [Masc Mase, --- Mcs7] by the following equation: 

Mis 


7 YMs 





Pi.s 


We then calculate the log likelihood that the spectrum of 
DNMs for individual x (m; ,) assuming a multinomial distribu- 
tion. This likelihood represents how well the DNM spectrum 
for an individual fits the SNV spectrum of a given species. 


Psy = Mix*log (pis). 
i 


To determine whether an SNV spectrum fit a DNM spectrum 
significantly better than others, we calculated the significance 
of the differences in likelihoods for a given individual when 
compared with chimpanzees, orangutans, and gorillas (x: [C 
O, G]). We simply calculated the fold range in fit as: 


exp(range(Pcx, Pox, Pex)). 


A “significance” threshold was set at a fold range of 20x, to 
replicate a standard « value of 0.05. The table below shows 
that none of these values approach 20 (supplementary table 
S1, Supplementary Material online). We conclude that the 
DNM spectra are too sparse to demonstrate clear species 
differences. 


Testing the Robustness of PCA Clustering to Species 
Representation 


We tested the robustness of the clustering of individuals by 
species in the NRNC PCA to differences in number of individ- 
uals sequenced per species. We down-sampled the number 
of individuals per species to match those of humans (n = 9), 
excluding the two orangutan species who were grouped as a 
single super-species group for this analysis. The clustering 
patterns in the PCA remained. 

We recreated the several plots from figures 1 and 2 using a 
rarer subset of variants and found no qualitative differences 
when compared with the original figures (supplementary fig. 


S6A, Supplementary Material online). To avoid the potential 
quality issues related to relying on singletons, we used only 
doubletons (DAF = 2/2 N, for N individuals sequenced from 
each species) in this analysis. The PCA of the individual muta- 
tion spectra across all species in the NCNR compartment alone 
still demonstrates the clustering of individuals by species (sup- 
plementary fig. S6B, Supplementary Material online). The dis- 
tribution of Euclidean distances between the NCNR mutation 
spectra of individuals within and between species still dem- 
onstrates the correlation of mutation spectrum distance with 
divergence time (supplementary fig. S68, Supplementary 
Material online). The PCA of individual mutation spectra 
from the NCNR, early-replication timing, and late-replication 
timing compartments still demonstrates similar trends as ob- 
served in figure 2D (supplementary fig. S6C, Supplementary 
Material online). The first PC, representing the greatest axis of 
variance among all spectra, separates spectra according to 
phylogeny and represents species signatures; the second PC 
separates compartments by their replication timing. The load- 
ings for PC2 are enriched for C>A and A>T mutation types, 
corresponding with the known late-replication timing signa- 
ture (supplementary fig. S6D, Supplementary Material online). 
We attribute the spread observed in chimpanzees in the rep- 
lication timing doubleton PCA below to be largely an effect of 
noise imposed by down-sampling. 


Testing the CG>GG Signature in ERVs 


We determined the enrichment for CG>GG mutations in 
ERVs compared with nonrepetitive heterochromatin was un- 
affected by the differences in mapping quality between the 
two compartments, noting that mutations in repetitive 
regions are more difficult to call confidently. To determine 
the potential confounding effect of mapping quality on the 
CG>GG signature, we compared the distribution of the map- 
ping quality value of the variants in the ERV and nonrepetitive 
heterochromatin compartments (MQ field in the GAGP vcf). 
Density distributions of the mapping qualities for the two 
compartments were highly overlapping within each species 
(supplementary fig. S20, Supplementary Material online). 
We also determined that the CG>GG signature was un- 
affected by the different “shapes” of the ERV and nonrepe- 
titive heterochromatin compartments, that is the distribution 
of segment lengths and overall size of compartment. For each 
ERV compartment segment for a given chromosome, we 
reassigned the coordinates randomly within the nonrepetitive 
heterochromatin compartment, preserving segment length. 
In the event a randomized compartment was chosen to over- 
lap one or more “N” bases, a new compartment was 
resampled. We did not filter for overlapping randomized seg- 
ments, but assumed that, given that the ERV compartment 
was a fraction of the size of the heterochromatin compart- 
ment (127 and 430 Mb, respectively), collisions would be un- 
likely enough to that their sparsity would not bias our 
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findings. This randomization process to create “ERV-like” 
compartments was bootstrapped 100 times. We then calcu- 
lated nucleotide content and mutation spectra for each of the 
100 bootstrapped compartments. We generated the log- 
odds of the CG>GG mutation type between the boot- 
strapped compartments and nonrepetitive heterochromatin 
compartment (each normalized to each other, using the 
Chi-square normalization method as described above), then 
compared those values to the same statistic of the original 
two compartments (fig. 4C) for all species. The observed 
CG>GG enrichment in the ERV compartment lies outside of 
the null distribution in all species except Bornean orangutans, 
for which the confidence interval of the observed log-odds 
overlaps the null distribution. Given that the observed 
CG>GG fraction still falls outside of the null distribution, 
we conclude that Bornean orangutans show the same trend 
as seen in other species, but that trend is not significant on its 
own. This lower enrichment in Bornean orangutans may be a 
result of their low sample size (2N= 10) and the paucity of 
observed segregating variation (65% that of Sumatran orang- 
utans, 30% that of gorillas, e.g.). The enrichment for CG>GG 
mutations comparing ERVs to nonrepetitive heterochromatin 
is significantly stronger than the enrichment in any of the 
bootstrapped, “ERV-like” compartments (supplementary ta- 
ble S6, Supplementary Material online). 

We tested the effect of species-specific nucleotide content 
on our rescaling method. The GAGP data are aligned to hg18; 
therefore, mutations in genomic regions in a nonhuman spe- 
cies that do not exist or do not map well in humans (e.g., new 
repetitive elements) were absent from our analyses. Our mu- 
tation rescaling method, however, relies on compartment nu- 
cleotide content determined from the hg18 reference, 
therefore including regions specific to humans and absent 
in other species. We compared the log-odds of each mutation 
type between the ERV and nonrepetitive heterochromatin 
compartment in all six species with those calculated by rescal- 
ing mutation counts using the nucleotide content of the seg- 
ments of a given compartment that successfully lifted over to 
each respective species (lifted from hg18 to gorGor4, 
panPan1, panTro2, and ponAbe2 for gorilla, bonobo, chim- 
panzee, and both orangutans, respectively, using default 
liftOver settings). The log-odds values within species are highly 
significantly correlated (all p > 0.95, supplementary fig. S19, 
Supplementary Material online). 





Supplementary Material 


Supplementary data are available at Genome Biology and 
Evolution online. 
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